To test the effect of one variable on another, simple linear regression may be applied.
The fitted model may be expressed as y=α+β^x, where α is a constant, β^ is the estimated coefficient, and x is the explanatory variable.
Figure: Example taken from R of a fitted model using linear regression.
Below is the linear regression output using the R's data set car.
Notice that the output from the model may be divided into two main categories:
output that assesses the model as a whole, and
output that relates to the estimated coefficients for the model
Call: lm(formula = dist ~ speed, data = cars) Residuals: Min 1Q Median 3Q Max -29.069 -9.525 -2.272 9.215 43.201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * speed 3.9324 0.4155 9.464 1.49e-12 *** --- Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12
Notice that there are four different sets of output (Call, Residuals, Coefficients, and Results) for both the constant α and the estimated coefficient β^ speed variable.
The estimated coefficients describe the change in the dependent variable when there is a single unit increase in the explanatory variable given that everything else is held constant.
The standard error is a measure of accuracy and is used to construct the confidence interval.
Confidence intervals provide a range of values for which there is a set level of confidence that the true population mean will be within the given range.
For example, if the CI is set at 95% percent then the probability of observing a value outside the given CI range is less than 0.05
The p-value is represented as a percentage.
Specifically, the p-value indicates the percentage of time, given that your null hypothesis is true, that you would find an outcome at least as extreme as the observed value.
If your calculated p-value is 0.02 then 2% of the time you'd observe a mean at least as large as your observed
In the overall model assessment the R-squared is the explained variance over the total variance.
Generally, a higher R2 is better but data with very little variance makes it easy to achieve a higher R2, which is why the adjusted R2 is presented
Lastly, the F-statistic is given.
Since the t-Statistic is not appropriate to compare two or more coefficients, the F-statistic must be applied.
The basic methodology is that it compares a restricted model where the coefficients have been set to a certain fixed level to a model which is unrestricted.
The most common is the sum of squared residuals F-test.
where Yij is observation j in treatment group i and μi are the parameters of the model and are means of treatment group i.
The ϵij are independent and follow a normal distribution with mean zero and constant variance σ2 often written as ϵ∼N(0,σ2).
The ANOVA model can also be written in the form:
Yij=μ+αi+ϵij
where μ is the overall mean of all treatment groups and αi is the deviation of mean of treatment group i from the overall mean.
The ϵij follow a normal distribution as before
The expected value of Yij is μi as the expected value of the errors is zero, often written as E[Yij]=μi.
In the rat diet experiment the model would be of the form:
yij=μi+ϵij
where yij is the weight gain for rat j in diet group i, μi would be the mean weight gain in diet group i and ϵij would be the deviation of rat j from the mean of its diet group.
The simplest random effects model is the one-way layout, commonly written in the form
yij=μ+αi+ϵij,
where j=1,…,J and i=1,…,I.
Normally one also assumes ϵij∼N(0,σA2), αi∼N(0,σA2), and that all these random variables are independent.
Note that we have stopped making a distinction in notation between random variables and measurements (the y -values are just random variables when distributions occur).
Note that this is considerably different from the fixed effect model.
Since the factor has changed to a random variable with an expected value of zero, the expected value of all the y is the same:
Eyij=μ
The variance of y now has two components:
Vyij=σA2+σ2
In addition we have a covariance structure between the measurements and this needs to be looked at in some detail..
First, the general case of a covariance between two general yij and yi′j′, where the indices may or may not be the same:
The μ and αi are the fixed effects and βj is the random effects.
Recall that in the simple one-way layout with yij=μ+αi+ϵij, we can write the model in matrix form y=Xβ+ϵ where β=(μ,α1,…,αI)′ and X is appropriately chosen.
The same applies to the simplest random effects model yij=μ+βj+ϵij where we can write y=μ⋅1+ZU+ϵ where 1=(1,1,…,1)′, U=(β1,…,βJ)′.
In general, we write the mixed effects models in matrix form with y=Xβ+ZU+ϵ, where β contains the fixed effects and U contains the random effects.
Therefore the likelihood function for the unknown parameters L(β,σA2,σ2) is
=(2π)n/2∣Σy∣n/21e−1/2(y−Xβ)′Σy−1(y−Xβ)
where Σy=σA2ZZ′+σ2I.
Maximizing L over β,σA2,σ2 gives the variance components and the fixed effects.
May also need u^, which is normally done using BLUP.